DCXnJKENT SESUMB 

ED 344 905 TM 018 225 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TyPE 

EORS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Shaver, James p. 

What Statistical Significance Testing Is, and Vhat it 
Is Not. 
Apr 92 

43p.j Paper presented at the Annual Meeting pf the 
American Educational Research Association (San 
Francisco, CA, April 20-24, 1992). 
Speeches/Conference Papers (150) 

1IP01/PC02 Plus Postage. 

aaucational Research; Evaluation Problems y Hypothesis 
Testing? Probability? Psychological Studies? 
^Research Design? Research Problems? •Sample Size; 
•Statistical Significance? Test validity 
•Null Hypothesis? «Randomization (Statistics)? 
Research Replication 



ABSTRACT 

A test of statistical significance is a procedure for 
determining how likely a result is assuming a null hypothesis to be 
true with randomization and a sample of size n (the given size in the 
study). Randomization, which refers to random sampling and random 
assignment, is important because it ensures the independence of 
Observations, but it does not guarantee independence beyond the 
initial sample selection. A test of statistical significance provides 
a statement of probability of occurrence in the long run, with 
repeated random sampling under the null hypothesis, but provides no 
basis for a conclusion about the probability that a particular result 
is attributable to chance. A test of statistical significance also 
does not indicate the probability that the null hypothesis is true or 
false and does not indicate whether a treatment be-.ng studied had an 
effect. Statistical significance indicates neither the magnitude nor 
the importance of a result, and is no indication of the probability 
that a result would be obtained on study replication. Although tests 
of statistical significance yield little valid information for 
questions of interest in most educational research, use and misuse of 
such tests remain common for a variety of reasons. Researchers should 
be encouraged to minimize statistical significance tests and to state 
expectations for quantitative results as critical effect sizes. There 
is a 58-item list of references. (SLD) 



* Reproductions supplied by EDRS are the best that can be made 

• from the original document. 



EOUCATfONAt RESOURCES INFORMATION 
CENTER <ERC> 



Or»g»n»t»ng it 
r Mtw>r cMoflV* fc*^ '^•^^^ improve 

fr^t 00 not n«cW»»^M, repnsnBfM Off»C»*t 



"PERMlSStON TO REPRODUCE 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATlONAt RESOURCES 
INFORMATION CENTER (ERIC) " 



WHM STftinSfncaL SICNIFICaNCE TESriMS IS, 
iOtD WHAT IT IS NOfT 



James P. Shaver 
Utah state Uhiversity 



p^per pr^ared for a syii?»si\m, Significanoe Testing in a Bost-^xjsitivistic Era: 
sane Proposed Alternatives, with Oonroents fnaa Journal Eflitors, at the annual 
meeting of the American EcJucational Research Association, San Franciscso, April 
22, 1992. 

Kesnneth E. Bell, David G. Giisscxi, Perry J. Sailor, and Matt Taylor provided 
helpful oGDsnents csi the paper. 



2 

BEST COPY AVULE 



4 * 



-1- 

Ohe use of tests of statistical significancae in educational and 
psychological xesearch has be^ under attadc for over 30 years, kacrg the 
critics have been Skinner (1956) , Bakan (1967) , Msehl (1967) , various authors 
in Harriscsi and Henkel (1970) , Seesnan (1973) , Signorelll (1974) , and CEOiii}adi 
(1975) . In 1978, carver ts excellent critique, *'lhe CSase Against Statistical 
Significance Testing,** yms published. X touched tm the natter as it related 
to the lack of productivity of educational resean±. in 1979, participated in 
an American Educational Research Associaticn symposium in 1980 with a title 
similar to the one for whicii this paper was prepared — "Tests of statistical 
Significance: Readdressing Iheir Role**, and then wrote a two-^part article 
cautioning educatioial practitioners about misinterpretations of statistical 
significance (Shaver, 1985a, b).* 

In 1978, Carver (p. 379) noted that all of the: criticisms of tests of 
statistical significance appeared to have had little effect. Hie situation 
has not chan^ since then. A (^ck perusal of educaticaml research journals, 
educational and psychological statistics textbooks, and doctoral dissertations 
will confirm that tests of statisticsd significance continue to dcndnate the 
interpretation of quantitative data in educational research. Surely one 
characteristic of statistical significance testing is that it is an oiduring — 
in the face of the devastating critici^, perhaps it would be better to say, 
relentless— phencmenon in educatlcmal and psychological research. 

The thrust of this paper, like so many written before it, is that the 
dcaninance of statistical significance testing is dysfunctional, beca\2se such 

* Much of this paper is a story told before, ihat has presented a quandary in 
regard to Ixxf extensively to develop various oono^>ts and to cite supporting 
sources for ideas that seesn well established, if not well aooqpted. I have 
pnsbably been both over^fni^ and excessive on both accounts at different 
points in paper. 



ERIC 



3 



-2- 

tests do not provide the Infonnation that mary researchers assume they do. 
Statistical signif icanoe testijig also diverts attention and eiergy ftm nDre 
a^jjaopriate strategies, sucix as replication and attattion to the practical or 
theoretical significance of results. In the hope that the accunulation of 
criticism will have an effect, I respond again in this paper to the cjaestion 
of what statistical significance testing is and ^t it is not. Possible 
reasons for the persistence of statistical significance testing are also 
discussed briefly, and proposals are presented for action ty jcumal editors 
to moderate the negative effects of statistical significance testing, if not 
eradicate their inapprcpriate use. 

What statistical sionificanoe Testing la 
A test of statistical significance is, at its very siaiplest in the 
dominant Fisherian loodel of hypotlresis testing, a procedure for determining 
how litely a result is assuming a null hypothesis to be true. Sanewhat more 
precisely, our ocBBonly used tests of statistical significance (^.-ratios, ir 
ratios, and g-ratios, such as in the analysis of variance or covarianoe) are 
procedures for determinii^ the probability (usually at a pre^jecified level 
called alpha) of a result under the null hypothesis (assuming the null 
l^^pothesis to be true) with randcnization* and a saiti>le of size n (i.e., the 
sanple size used in the stut^) . 

Individual eleanesnts of that stateaaatt are inportant, althou^ often 
overlooiked. First, the result of a test of statistical significance is a 
probability statement, often expressed as a dichotany in terms of vftiether the 

» I use randcBiizatiCTi to incliids"both randan sanpling and random assigraaent, 
although soaae authors use randondzaticn to refer only to assigraaent. Mxh of 
the discussion that follows is in terms of random sanpling; tut, as I point 
out later, random assignment also meets the randonness assunption. 



prcbability vas less or greater thsax the alpha level. Seoond, the test is 
based the assunptlon that the null hypothesis is true. That is, the 
theoretical sanplin? distrUbutions against uhich results are ocRpazed (the 
ncfrmal distribution, the ^-distributicns, the £-distrihuticn3, the cM-sc^jare 
distributicsis) are generated by assuming that sanpling oocurs f rem a 
pGpulati(xi, or pcpilaticsis, in \Mch the null hypothesis is true. Thizd, 
despite some claims to the omtrary (e.g. , Thon|)scn, 1987) , randcnizatiai is a 
fundamental assunpticn underlying the use of these tests of statistical 
significa n ce. Fourth, sanple size is a ccucial consideration, because the 
statistical significance of a result will depend cn the nunte: of cases on 
Which it is based. "Eacii of these elesnents will be alliided to in the 
discussion that follows. 
Randoamness as an As suirption 

As Glass and Hoplcins (1984) stated, ''Inferential statistics is based on 
the assunption of randcm sampling frcm pc|ulations" (p. 177) . Elses«4ierB, they 
refer to "randcm saxcples" as one of the "building bleed® for hypothesis 
testing" (p. 202) , and they ^ecify randcra sampling as an essential assunpticai 
for the use of the one-sanple ^-test for a mean (p. 205) , the t~tests for two 
ind^>endent means (p. 231) and for the differencje between means from 
correlated observations (p. 241) , and the ^-ratios for differences between 
variances (p. 261) and in the analysis of variance (e.g., pp. 342, 445). 

Pandcndzaticai is important b ecaus e it helps to ensure the ind^)endence 
of cbservaticxis (or, eqoivalently, errors; Glass & Hopkins, 1984, p. 350). 
Despite what is cairacnly assumed, hcwever, randcnmess does not guarantee 
independence beyond the initial sanple selection. For exanple, observaticais 
al m o s t certainly will not be independent when treatments have been delivered 



-4- 

to subjects in a group setting, as is ocmnon in educaticnal reseazdi. 

m addition, ranacndzation ^ ffi^^fl^ to the typical tests of 
statistical signif icanoe. Ranctamess (i.e. , ranaan error) is the basis ftor 
the saopXir^ distributions agaiist %«iic*i results are ccnpared. Use of, for 
exanple, a t-distributicn to answer the question, "Hew likely is this 
particular result under the null hypothesis?" will not yield a meaningfial 
probability statonsnt if the sanple is not randcn. Repeated random sanpling 
(or assignment) yields Known sanpling distributions. Nonrandom sanpling does 
not, iK>r does the oonparisan of a nonrandon sanple to a randcroly generated 
sanpling distribution provide a valid stateanent of probability of ocscwrrenDB. 

Hie indispensability of randoraness may be more evident \Aien the question 
being addressed in tests of statistical significance is stated as, hew 
r^aresaxtative is this sanple (sanple statistic) of the pdxOation (pqpaLatioi 
parameter) as ^secified in the null hypothesis? Without randcraiess, that 
question cannot be answered validly using the oannon tests of statistical 
significance. As Glass and Hqpiklns (1984) put it: 

The tnethod of random selection of sanples will ensure, within a 
certain lowwn margin of error, representativeness of the sanples 
and henoe will permit establishing limits within which the 
parameters are expected to lie with a particular probability. 

fsamplina error) is an iroortant leafaiEB fi£ a langSB saSEla 
[en?4iasis in the original] .... It is not possible to estimate 
the error with accidental sanpling and many other sanpling 
strategies since they contain unknown types and degrees of bias in 
addition to sanpling error, (p. 177) 

ERIC ^' 



I 



In that contesct, I found It baffling that Dxnpson (1987) would assert 
that "fidgnificanoe testingr iiqposes a restziction that saiqples nust be 
ZEpcesentative of a population, but does not mandate that this end most be 
zeallzed through zandoni sanplingi** (pp. 8-9) , and then go en to discuss 
"cotparing loiown sample characteristics with known pcpulaticai characteristics 
to build seme warrant for an a^unption of representativeness" (p. 9) . 
description of sample characteristics in order to allow generalization to 
populaticsTS fran which a randcsn sanple was not drawn is an isportant, and 
often neglected, elexoent of r^^xch reporting (Shaver & Norton, 1980a, b> . 
In fact, such descriptioi of sanple characteristics is cnicial, even if a 
randcm sanple was used, to assist readers in making generalizations— both 
because the saitple may not have been drawn from a pc^solation in which a 
research user is directly int^ested and, equally inportant, because a random 
sanple my not r^resent well the pc^julation from which it was drawn. 

Such descriptions are iK>t, however, a substitute for random sampling; 
the purpose of randomness is not to ensure representativeness (if that oould 
be done, there would be no need for an inferential test) , but to allow the 
specif icaticxi of the prctoability that a sairple came froro a population with an 
hypothesized parameter (or, ccaiversely, to estimate a range of probable values 
for a par am e t e r ) . In short, random sanpling addresses solely the 
r^resentativeness of sanples in the long run; it does not ensure that all of 
the ciiaracteristics of a particular sanple, including the dependent 
variable (s) under investigation, will be the same as those of the population, 
only that (vhether assessed or not) they will differ only by chance from the 
population characteristics, of course, this also means that in conventional 
significance testing with random sanpling from a population in which the null 



ERIC 



7 



lOTOthesis is txue ani with alpha set at .05, 5% of the tine the researdier 
will inoarrectly conclixte that the sanple did not oone froa the popilation 
qjecified in the nuU hypothesis? hcwever, a conclusion that the sain>le was 
not representative of the ^«cif ied papulation (with « as the cxiterion) would 
be OQExect. 

m essence, the mistake is in viewing ranacmization as an cutocrae (i.e. , 
representativeness) rather than as a procsess (i.e, , sanpling in vMcii every 
TossSber of the pc^wlation has an equal chanoe of being selected for the 
sanc>le) . This error is not unooranon, and can even be found in statistics 
bociks. Far exanple, FterguscHi and Takane (1989) provided an exan|>le of the use 
of dii-sguare "to test the represajtativeness of a sanple vftiere certain 
populaticsi values are known" (p. 218) . They analyzed a set of data ocoixjsed 
of 200 individuals drawn (the process is not specified) fran the pcfMlation of 
Montreal. Ihe differenoe between population and sanple feequencies for three 
levels of naticaial origin—French, Biglish, and other— was statistically 
significant at the .01 level. Pergusai and Takane conducted, erronecusly, 
"that the sanple . . . cannot be considered a rantSan sanple" (p. 219) . of 
course, the question of randoraness is not a matter of a cixi-square goodness of 
fit test, but of the process by which the sanple was drawn. Had they said 
that the sanple could not be ooisidered a "representative" sanple, their logic 
would have been correct, although statistical significance has dubious 
validity as the criterion for such a decision. 

Random assioniaent . Ihe term rantonizatioi is used scciewhat antoigucusly 
in discussions of experimental design and tests of statistical significance. 
Seme (e.g. , Hays, 1973, p. 562) use randomization to refer generally to the 
application of rantScm processes in designing experimental studies, 



ERIC 



-7- 

enocoqpassing both rarxSoD saB|>liiig and randan assignment' , on the other band, 
there are those \iho use xandcndzation to refer only to randan assigrsDent to 
treatxoents, (e.g., Ferguson & 'fiakane, 1989, p. 245; Winer, Brcjwn, & Hichels, 
1991, pp. 7-8) . Focusing on randan sanpling, and not dismissing ranci-xn 
assignment, is also oomnai. lhat is the case with Glass and Hopkins (1984) , 
\ibo dlsaips tests of statistical significance in terms of randan sampling but 
not randan assignment to treatments. 

Although randan assignment is not cannon in educational researcii, it is 
more so than randan sampling (Shaver & Norton, 19aoa, b) . Pts Berk and Brewer 
(1978) pointed oxt, with randan assignment, researchers can appnspriately 
oonparu their results against the sanpling distributions ooranonly used in 
tests of statistical significance (also see Hays, 1973, p. 562; Winer et al., 
1991, p. 8) . Whereas randan sanpling ensaires chance saxtple differences fron 
the sGuroe population on all characteristics, ranksi assignment ensures that 
differences between the groups on all variables, assessed or not, are 
nonsystesnatic. i^in, there is no assurance that the groups are not different 
on any inportant variable. In fact, a test of statistical significance may 
indicate, even £ifter randan assignment, the groups are sufficiently 
different on the variable (s) under analysis that, following the logic of the 
inferential test, one slKxild conclude that they did not ccniB from the same 
pqfwlation. 

R^jeated randan assignment to groups fron the same population will 
result in a sanpling distribution of mean dif ferenoes with a mean equal to 
zero. Uie s., t, or Z distributiais can be used in tests of statistical 
significance to detennine the proiaability of a particular mean difference (or 
difference in sane other statistic) under the null hypothesis of no 



ERIC 



9 



6iftec&xse. Of course, with only random assignment, a test of statistical 
signif icaj¥3B provides no basis for generalization to a specific population, 
althou^ it can be regarded as ack3ressing the c|iiestion vihether the groups 
under analysis can be regaxxlQd as sanples frcra the same undefined pcfulation. 

a\jst as randcn sanpling will not ensure that a particwlar ssnplB is 
r^sresentative of the population from \diidi it is drawn, random assignment 
does not prxjvide assurance that the resulting groups are idesitical with one 
another or, put alternatively, that the groups are equivalent splits fraa the 
sane hypothetical infinite populaticai (McHugh, 1964) . Random sanpling into 
treatioent groups addresses both the estiaatiai of population parameters and 
the likelihood of associatiois between treatment group mesBber^iip and 
preexisting characteristics; randkam assignment addresses only the latter. 

It should be noted that, without randan assignment, an exact probability 
test of statistical significance can be based on a sanpling distrilxition 
generated by randomly splitting the available sanple into all possible 
conbinations of the size of the groups in the study and ocnputing the relevant 
statistic for each ccnbinaticn. The researcher can then ask, using that 
distribution, how likely it is that an obtained rasult would have ooajmned by 
chance (e.g., Berk & Brewer, 1978; Winch & canpbell, 1969). Such probability 
tests are rarely riqported in the literature, however. Traditional tests of 
statistical significance are typically applied, often ignoring the assimpticm 
of randcmization. 

Violations sf randomness . Uhfortunately, stati sti cs textbodc authors 
tend to ignore the effects of violating the randcmness assunption. For 
exanple. Glass and Hopkins (1984) , who are explicit about the inijortanoe of 
randcro sanpling if not random assignment, disoiss the effect^ of violating the 

iil 



assuD{;tians of nonnal pQ|3ulatics) distrifauticsis and haacgenBOus population 
vatianoes, Ixxt not the effec±s of lack of randonness on tha sggfcapriateansss of 
drawing a oondusicn about a particular result using the theoretical sanpling 
distritution. Hie effect of lack of randonness on the indfyendenoe of scores, 
is, however, often mentioned (e.g.. Glass & Hopkins, 1984, p. 353). 

Ore reason that randcnness is often ignored may be that the examination 
of the effects of violating that assunption is a fbrnddable task because it 
inwlves all sanple-pc^sulation characteristics, not only the depencJent 
variable as with the nonsality and hcoogeneity of variance assunptions. As I 
have pointed out earlier (aiaver, 1980) ; 

To enumerate every potentially relevant variable and specify its 
relationship to the d^sendent variable (s) , being certain that no 
crucial variable was overlooked, in order to investigate the 
effects of nonrandcsnness on probability stateanents presents 
insuperable difficulties, (p. 6) 
Nevertheless, the general condusion that levels of randcmness can be 
overlooked, as is ocanmcai in the r^rting of educational research, nust be 
challenged. As Winer et al. (1991) stated in discuss: ng analysis of variance 
assunptions: 

Violating the assunption of randcaa sanpling of elesnents fran a 
population and randcra assignment of the elements to the trBatanants 
nay totally invalidate any study, since randcroness provides the 
^>ssu3rano9 that errors S£S jJ^dojendently distributed [enphasis 
added] within and between treatment conditions and is also the 
mechanism by which bias is removed frcra treatmant oonditicxis. (p. 
101) 

u 



-10- 

HoMsver, their taible sunmariziiig the "oonsetjuenoes of violation of asaaiJtlons 
of the fixed-effects AMOVA" (p. 102) inoluaes rancScndzation only in tenns of 
the inctepentSenoB of observations (errors) , perhaps beca ? js e the other 
oonserpTennpf; are unknown, and unikncwable in practice. 

An analociv . To sum up, the ccmnonly used tests of statistical 
significanoe provide the researcher with liadted infonnation: Hew liXely is 
this result, assuming the null hypothesis to be true and with randomizaticn 
(randoni sanpling ajod/ar assignment) and a sanple of size q? Without 
randomess, the result of the test of statistical significanoe is ineaningless 
or, at best, its relevanoe to a statesoaart of probeibility is indeteniiinat« 
Frequently in educational research, the researcher goes into a sciiool or 
schools, obtfidns available groins (with neither random sanpling or 
assignment) , collects data— sonetimes with, and scsnetiroes withcwt, a 
treatment— and then ccxiducts tests of statistical significanoe. The results 
of such inferential tests are essentially meaningless, unless one is 
interested in ccnpariscais to an abstract standard of probability as indicated 
by the question: What would be the probability of tiie obtained result if 
randcm sasples had actually been drawn? 

Consider an analo^sus situatiwi: A person walks into a rxxm and sees 10 
coins lying on a table. He observes 8 heads and 2 tails, and wonders if the 
coins are biased. So he asks if this particular arran^snent of coins is a 
likely chance occurrence. Having a statistics book handy, he turns to 
Pascal's triai^le and finds that the probability of obtaining 8 out of 10 
heads is 45/1024 or .044. Because that probability is less than the 
traditional .05 alpha level, be concludes that his result is not a likely 
chance occurrence under the null hypothesis of a 50-50 ^lit in tails and 



12 



-11- 

heaas, and that he has evidenoe that the coins are biased. HawBver, he 
cdearly slwuld ooncltjae is that 4f he h?^ fliiped each ooii*-or, 
alternatively, had flipped 1 cxain 10 tiines— the probability of the particMlar 
result cocurring by chance is less than 5%. But he has no evidenoe as to the 
bias In the particular set of coins becaiise they were not, as far as he tawws, 
flipped. That is, he does not Knew the process by they arrived in their 

positioais. The theoretical (bincBdal) distributicsi provides only an abstract 
standard of little relevance because the data were not produced in such a way 
as to meet a basic assunjjtlcai for use of the distribution. (Describii^ the 
Ftiysical properties of the coins vis-a--vis biasedness would be a substitute 
for flipping and use of the bincmial distributioi, not proof that the bincmial 
distributiCTi vras applicable.) 

Just as applicatiai of the binomial distribution could not provide valid 
infarmation about possible bias in the observed coins, so educational 
researchers vjho use nonrandcmized groups cannot obtain valid infornation about 
the probability of a group difference under the null hypothesis using a oomrnon 
test of statistical significance. Ihe t-distribution, for estauple, has no 
more relevance to differences between available group^s than the bincsnial 
distribution does to the possible bias in coins found lying on a table. 

mat Statistical Sicmif icanos Testirn Jg ^ 

A test of statistical significance used without randcoiizatiop, then, 
does not yield valid infomation about the probability of a result under the 
null hypothesis. The follcwing brief listing of what tests of statistical 
significance cannot do for the researcher is, therefore, based on the 
assunption that data cane from a design that includes randonizaticsi— either 
randcm sampling or random assignment. As noted above, with randanizaticn, a 



ERIC 



!3 



test of statistical significance provides a researcher with infomation on the 
pcda^bUity of a result assuming the null l^othesis to be true and given the 
san^xLe size. On the other hand, a test of statistical significance does not 
provide infonnation on a nuntoer of matters of interest to researciiers, even 
thou^ it is oftai presumed to do so. 

A test of statistical significance provides a stateaajait of jaxtoability 
of occurrence in the Img run, with repeated random san|>llng (or assignment) 
under the null hypothesis. As Carver (1978) argued, it is a fantasy to 
believe that such a test speaks to \Aiether a particular result is a chance 
occurrence. That is, a test of significance provictes the probability of a 
result occurrii^ by chance in the lor^ run under the null hypothesis with 
ranaon samplii^ and sanple size n? it provi<tes no basis for a conclusion about 
the prctoability that a particular result is attributable to chance. 

Even with random sanples drawn f rem a population in %*iich the null 
hypothesis is true and in \Aiich scores are distributed according to the 
assumpticaTS for the statistical model, with alpha set at .05, 5% of the time 
the researcher will conclude that a result is not a liJ«ely occurrence under 
the null hypothesis, thus making a Type I error. But the researcher has no 
way of knowing for any particular result vAiether that error is being made or 
if the sanple was drawn from a pc^aticsi in which the null hypothesis was not 
true. That is \*iy, acoordii^ to Tukey (1969), R. A. Fisher's "standard of 
firm knowledge was not one very esctranely significant result, but rather the 
ability to r^jeatedly get results significant at 5%" (p. 85) . PeplicaUai is 
essential to ccaif idenoe in the reliabUity (reproducibility) of a result, as 
well as to conclusions about generalizability (esctemal validity) (e.g.. 



14 



-13- 

CEQi{]bell & Jackson, 1979). 
Hbaj^ About it? 

A test of statistical signif icanoe does not iwiicate the ptctobility 
that tlie null hypothesis is tjaie or false. It provides the lesearctwr with 
informatiOT in regard to the liJcelihood of a result given that the null 
hypothesis is true; it does not indicate the likelihood that the null 
hypothesis is true given a particular result. Carver (1978) and J. Cdhen 
(1990) are anoig those who have cauticsTed against that fallacy. 

Unfortunately, authors of statistics books and research reports 
frequently make statements about rejecting the raill hypothesis based on one 
statistically significant result, ihat is too absolute a oonclvusioi. If any 
infert/we about the null hypothesis is to be drawn frara one test of 
statistical significance, it should be stated in ternB of evidence for the 
plausibility of the null hypothesis, not an absolute rejecticsi. Even though 
the "rejection" decision may be set inplicitly in a pniaabilistic ccntext— 
with, e.g., a .05 chance of a T^P® 1 error— that qualification is, typically, 
quickly i9ix}red. 

Conversely, a test of statistical significance also does not provide 
information on the prxAability that an alternative hypothesis is true or false 
(see, e.g., carver, 1978). Hie samplinj distrihuticxi is based on the null 
hypothesis; so no evidence is provided as to the likelihood of the result 
occurring under alternative hypotheses. 
M23t ^cyt Treatment gfggcfcs? 

One of the racace egregious errors is to conclude that a test of 
statistical significance indicates whether a treatment being studied had an 
effect. Clearly, the test of statistical significance arViro«c^ only the 



15 



sinple question \A)ethBr a result is a likely oocurrenoe under the null 
hypothesis with rerrioaizatiai and a sanple of size q. At most (as noted 
above) , a statistically significant result has to do with the prctoability of 
the result in the long run (i.e., over re|>eated sanples), not with whether the 
particular result did or did not occur under the null hypothesis. 

It seoiB terribly obvious that a test of statistical significance does 
not ^)eak directly to causality. Even if a researcher were willing to 
OOTclude, following a test of statistical significance, that a low probability 
result did not ocme feoa a papulaU<m in which the raUl hypothesis is true 
(i.e., the result was not a chance oocurrOTse under I^) , and did not mate a 
Type I error in doii^ so, the test of statistical significance provides no 
evidence as to the cause of the result, -mat is a matter of design, not of 
statistical inference. Not evsi the selecticn threat to internal validity is 
perfectly controlled by random sanpling or assignment? siitply by dianoe 
pretreatraent groups can be significantly (statistically srd/or practically) 
different cai one or more relevant variables. 

Uie fallacy of ccsicludii^ that a statistically significant result 
indicates a treatitent effect (as often seen, for exanple, in statenents sudi 
as, "the statistically significant difference betweat means indicates that 
Treatment A was more effective than Treatment B") is likely perpetuated by 
statistics booiks in whi^i it is suggested that tests of statistical 
significance will addr^ questicais such as: "Is the treatanent effective? 
Does Drug X reduce hyperactivity more than a plaodao? Does anxiety level 
influence test perfoojanoe?" (Glass & Hopkins, 1984, p. 230). 

Statements sud?. as, "the mean practice effect was highly significant" 
(Glass & Hciadns, 1984, p. 242), following a t-test ocnparing pre- and 



16 



-15- 

posttest means, can misaead as well, as can a statonent that a confidence 
interval for a difference between means indicates the range of values that 
enocrapass "the true treataient effect Qi^ - fi^) " {p. 236) . The anbiguity of 
the terra effect as used in inferential statistics, especially analysis of 
variance, to refer to a ooB^sxiscn (e.g. , a ''main effect**) does not help. As 
J. Oohen and P. Oohen (1983) pointed out, the *'causal inplicaticn of the term 
effect** can obscure the fact that **causal interpretations are never warranted 
by statistical results, but rec^iire logical and substantive bases'* (p. 210) . 

NcsTtextbook discussions to inform researchers about inferential 
statistics can misinform, as well. Far exanple, J. Ooben (1990) ccnBoented 
that **everyone Tokms that . . . all [statistical significance] means is that 
the effect is not nil . . .** (p. 1307) . And, in a generally sound piece. Berk 
and Brewer (1978) said: **If this null l^pothesis [m-, = Mj] is rejected, it can 
be concluded that presence or absence of the specific diplcna treatment 
contributes to group differences in inccsne'* (p. 209) . Such statements create 
and perpetuate an erroneous view of the relaticnship between tests of 
statistical significance and causality. 
Wh9t Abcut Magnitude ^ Itoortanoe? 

Despite fregiifint conclusions to the contrary in research reports (as 
noted, e.g. , by Bracey, 1991, and Harcum, 1989) , statistical significance 
indicates neither the magnitude nor the iirportance of a result. Statistical 
significance is only information in regard to the prcbability of a result 
under the null hypothesis with randomization and a saxqple size of n. 

Sanple size is, of course, a primary oonoem in this particular Instance 
of viiat a statis tic al significance test is not. Tor exanple, with n = \0 and 
* = .05 for a nondirecticnal test, a correlation of .63 is statistically 



17 



-16- 

significant? with a » 50, an r » .28 is needed? with n « 100, an £ » .20; with 
n - 500, an £ » .09; with n « l,ooo, an r « .06; and, with a - 10,000, an £ = 
.02 is statistically significant at tl» .05 level (Glass & Hopkins, 1984, p- 
549) . Alternatively, with a standard deviation of 10 and a =» 20, a difference 
of 9.4 between 2 indepo^ient means is necessary for statistical significance 
at the .05 level in a ncnaiiectional test; with a = 100, a difference of only 
4.0 is reqaixed, and with a = 1000, a difference of only 1.2 Is recpxired. 

Ctoviously, very aaall and trivial results as well as ia|>ortant ones may 
he statisticaUy significant. As Me^ (1967) , along with a host of other 
writers, has pointed out, with a large enough san|>le and reliable assewa n ea n tt, 
pcactically every association will be statistically significant, conversely, 
with a very small sanple, very few results will be statistically sl^ficant- 
Therefore, to taww only vAiether a result is statistically significant tells 
caie virtually northing about the magnitude or inportanoe of the reaolt. 

It is so conroonly stressed that the statistical significance of results 
is directly a fUncticai of sample size that one can only wmder at the nunter 
of articles in %ihich results are either interpreted as luportant b eca u fM^ of 
statistical significance or in vihicii the prcjbability level appears to be taken 
as an indicatiai of magnitude, as suggested toy the use of terms such as 
••hi^y significant*' when the probability is .01 or less. Even statistics 
textbooOc authors make the latter mistake, as in Glass and HcjpJdns* (1984) 
references to a ••mean practice effect [that) was highly significant** (p. 242) 
and a •M.tiple correlation coefficient [that] is highly significant*' (p. 314) 
vtei the prcbabilities were .001 or less. 

Effect gizes . Statistical probedaility is not, then, a useful indicator 
of magnitude of a result because it is depend e nt on saaple size. This 



IS 



0 • 



-17- 

dsflciericy becaroa especially cl In effcarts to prepare quantitative 
»«^i^literatare suonaries. Glass (1976) proposed tiiat effect sizes*— 
sietrics for the nagnituds of results that are initepenient of sancile size and 
scale of neasureraent— be used in reporting results. 

Effect sizes, too, nay be misinterpreted, however. Despite J. Ochen's 
(1988, p. 12-13) cauticais to the aantrary, researchers have ignored the 
arbitrariness of his conventions for lav, medium, and large effect sizes 
(e.g., .2, .5, and .8, respectively, for standardized iDoan differences) . Yet, 
as has been m^ dear by a nunijer of authors (e.g. , Glass, NcGaw, & Smith, 
1981, p. 104; Shaver, 1985b, 1991) , an effect size of 1 or larger may reflect 
a trivial result. Hie depentJent variable may lack valtie (benefit) , the 
construct or characteristic may not have been validly assessed (Nessidc, 
1989) , or the result may too costly to produce or its reliability may be in 
doiibt. Siibstituting sanctified effect size ocsnventions for. the sanctified .05 
level of statistical significance is not progress. 

^tM&r analysis . One reacticsi to the relationship between sanple size 
and statistical significance has been the call for statistical power analysis. 
J. Cohen (1988), amoTKj otJiers, has indicted the low pawei>— that is, the 
paxbability of obtaining a statistically significant result if there is a 
difference in the pcpilation— of mudi psychological and educational researdi 
due to the small sanple sizes typically used and the small to moderate effect 
sizes that are occBnan. 

In conductinj a power analysis, the p<^julation value is not known 
(otherwise, a test of statistical si5:>ificanoe would be irrelevant) ? an effect 

* B^&t SiSS wcwld be a better term to avoid cause-effect iirplicaticans. But 
effect size is prdsably too firmly pnfo<?dded in the educatiocial research usage 
to change now (Shaver, 1991) . 



o 

ERIC 



size most be estimated, either the papulation value or the ndnisum result that 
would iidicate practical significance. Onoe this is done, the researdier can 
senipulate sanple size, alpha level, whether the alternative hypothesis is 
dizectlcnal or nonUxecticnal, ancl even the nagnitucte of the estimated effect 
size to ctotain a desired level of power. All seen to be rather meaningless 
exexdses, ar. intellectual game, cnoe the eftect size of interest has been 
specified (a point to \«iich I wHl return) . The oonoem should be %ftiether an 
anticipated effect size is obtained, not how to manipulate design and analysis 
elearents so that the result, if obtained, will be statistically significant. 
'Jto fbcus attention on the s^^prqpriate issue, S. A. Oc3hen and Ionian (1981) have 
insisted that their <toctoral students specify an effect size, not just an 
alpha level, as the criteriai s^ainst whicii to judge results. 

Sonetiilng else vfaich a test of statistical significance is not, is an 
indication of the paxbability that a result wculd be obtained upon replication 
of the stuc^. A test of statistical significance yields the probability of a 
result occurring under the null hypothesis, not the probability that the 
result will occur again if the study is r^icated. Carver's (1978) treatanent 
should have deal'c a death blow to this fallacy, too. 

With the ranacaaization loodel, OTe has no way of taKwing hew close a 
particular result is to the population parameter. The more extreme a 
statistic is in the sanpling distribution, the less liJcely it is to be 
reproduced vpan replication. That is, with the continued drawing of randcm 
samples (and especially with continued random assignment \*en not conjoined 
with ranctcn sanpling, so there is lacX of control of the l^^pothetical 
populations txxm which the assignments are being xoade) , there could be 

20 



-19- 

ocSTsiderable fluctuaticn in the statistics obtainad. Moroover, sanpling 
(i.e. f ranScn) exror is oonpounSed by the ea^ierinental cixtiuxnstanoes that make 
it difficult in educational settings to inplemsnt a previous design with 
fidelity, induding the valid rejirodiictiGn of a tzeatment, acxDss 
r^lications. These difficulties, coupled with the limited infonnation fton a 
xandcnnessHQased probability stateanent, are vAiy statistical significance does 
not indicate the reliability, or replicability, of a result, imfortunately, 
the oontrary assmtption has been ocmnon, enoouraging a one-shot approach to 
researtii (Shaver, 1979). 

Statistical significance not only provides no infonnation about the 
probability that r^ications of a study would yield the same result, but is 
of little relevance in judging \iftiether actual r^ications yield siadlar 
results. Similar probabilities oould be based on quite different results and 
different probabilities could be based on identical results (see Itosenthal, 
1991) . For that reason, a reocnaaendation that statistical significance and 
the statistical power of tests of statistical significance, in addition to 
effect sizes, be r^rted in replications (Rosenthal, 1991) naSoes little 
sense: Ihe question of interest is whether an effect size of a magnitude 
judged to be inportant has been consistently obtained across valid 
replicatioTs. Whether any or all of the results are statistically significant 
is irrelevant. 

Persist^>cc of Tests of S tatistical Sionif icanoe 
Tests of statistical significance, then, yield little valid infonnation 
pertineait to the questions of interest in roost educational research.* As I 

* Projects, such as the NaticHial Assessment of Educational Progress, in \<hich 
random sanpling techniques are used in order to estimate population parameters 
are notable eoo^ytions. 



21 



-20- 

noted 12 years ago (Shanrer, 1980) , tha oontimed xelianoe on inferential 
statistics in edtucaticnal research in the £aoe of the many criticisasB vrould 
surely nake an excellent study in intellectual history and the sociology of 
ideas. There is, undoubbedly, a ocnplex web of causal factors? it is an 
owersinplification to claim that the fault lies with journal editors ^Jho 
discourage the sUbnission of reports of statistically nonsignificant results 
and of r^lications and who do not accept articles in %4iich quantitative data 
are reported without tests of statistical significance. 
E mx >is Qidssion 

Uiere are, undoubtedly, many subtle factors involved in the continued 
use and misuse of tests of statistical significance. For exanple, althci#i I 
have yet to see an article in Which it is argued that a test of statistical 
significance does mare than provide information on the likelihood of a result 
occurring under the null hypothesis, an error of cndssion is odnmon. That is, 
there is a tendency to not remind readers of the central position of randan 
sanpling in the logic of the coramcaily used tests of statistical significance 
(e.g. , J. Cohen, 1990) or to mention randomness at one point in the 
discussion, but ignore it elsevAiere (e.g.. Berk & Brewer, 1978, pp. 190-191? 
Carver, 1978, pp. 381-382, 385? J. Cohen, 1990, e.g., pp. 1307, 1310; S. A. 
Cohen & Hyman, 1980) . Althou^ the authors may understand that randomness is 
central to the rasaningful interpretation of tests of statistical significance, 
many readers will miss that point when it is emitted fron much of the 
discussion. 
S tatistics Courses 

That educational researchers might not read **with randomization" into a 
statement such as, "Miat [statistical significance] tells us is the 



22 



-21- 

prcbability of the data, givei the truth of the null l^ypothesis . . (J. 
Ochen, 1990, p. 1307) , is not sucprisincf in light of the training that nary 
reoeive. statistics teictbooiks are scmetiioes the souroe of '*syths and 
misoono^jtions** that lead researchers to overinteipret inferential statistics 
(Bresger, 1985) . Moreover, statistics courses, liloe textbooks, are ftequently 
geared almost solely to helping students recognize the types of analysis 
situations to vrtuch various inferential statistics are e^^plicable and to 
interpreting the results in terms of statistical significm:^. They are not 
ained at encouraging students to question tha uses or usefulness of tests of 
statistical signif icanoe, sudi less to reflect on tha role of random sanpling 
in their inferoitial tests. 

TXiet tendency for statistical over-analysis and unthinking interpretation 
is being exacerbated ty the growing availability, with personal ccnputers in 
practically every office, of black-box statistical solutic^is: The researcher 
puts data in and receives output, often without understanding the assunptions 
untterlying the analysis, and is enooura^ in this activity by a current 
en53hasis on data analysis with hi^-speed coiputers at the hi«^est possible 
level of statistical ccnplexity. J. Cohen's (1990) caution that "sijtpler is 
better" is refreshing advice that will likely be ignored with ccoplesc data 
analyses so easily aooonplished. A possible cause is the education that many 
educational researchers receive, 'Hieir statistics courses in particular are 
not philoscjiiical in orientation. Ihey are trained to be mechanical appliers 
of statistical techniques, rather than educated to be thoughtful, critical 
users (Dar, 1987) . 
Graduate Ooronittees 

Graduate students* training is often reinforced by their experiences 



with their graduate ocnmittees. Faculty often insist on the application of 
tests of statistical signif icanoe when they are not j^jprcpriate, sudi as in 
the situation ^Ansre the stixient will collect data on a total pofMlation. As 
alleged by Brewer (1985, p. 246) , there is also an unthirtdng insisteioe by 
graduate oanadttees m the laixsrtanoe of results statisticaUy significant at 
the .05 level. The ayth persists that only results that are statistically 
significant at or b^owi the .05 Imel are inportant, and that •Taeyond" means 
"of greater magnitude" ancVor "of ^neater iBportanoe." Graduate students are 
freqitently anxious that they might not obtain statistically significant 
results and, oonseq^ent^y, their research will not be aooeptable to their 
canmittees. Misunderstanding of the meaning of statistical significance is 
perpetuated, along with misajpr^iension of the iirportanoe of negative (no 
differ«KJe) findings in the production of lawwledge, a point for discussiai on 
another day. 
Being "Scientific" 

Among the reasons for the insistence on tests of statistical 
significance is that they provi<te a facade of scientisan in research. For many 
in educational research, being qaantitative is equated with being scientific. 
NUntoers, perhaps, give a sense of security and of seemingly firm results, 
especially when researchers oonftise the nuntoers with reality (Shaver, 1991, 
p. 94) . An extension is the assunpticxi that mathematics, including 
inferential statistics, are essential to science— despite the fact that sane 
scientists (for exanple, physicists) and many outstanding psychologists (such 
as WUndt, Piaget, lewin, and Skinner) have managed very well without 
inferential statistics (J. Oohen, 1990, p. 1311) . Iftifortunately, attention is 
distracted frcro the serious consideration of educational research as science. 



24 



-23- 

Including the role of replication (Shaver, 1979) . 

•rtje use of statistics has kseen caritlcized as ritualistic, because it is 
so often unthinking practice (e.g., J. Cdhen, 1990; Seesnan, 1973; aaver, 
1987) , anci even oonpared to religious ritual (Salsturg, 1985) . What Salzburg 
noted in regard to »»the religi«i of statistics" in the medical profession 
applies equally well to edixatianal research. There are the high priests, 
those ^Oto have the secrets to the nysteries of God, salvation, and eternal 
life (of statistical inference) and th^ are not li^itly or easily challenged. 
The religious practitioners go to church (start up their ccnputers) and go 
through their rituals (ccoixite their tests of statistical significance) hoping 
for salvation (findings at the .05 level). Overall, a sense of security 
pervades; the practitioners have confidence in the high priests* access to the 
underlying n^steries, allowing them to avoid dealing with the foreboding 
uncertainties of the meaning of life (or of the meaning of statistical 
analyses) . 

Ihe term »*inystery" is particularly significant in this discussion 
because, to many educational users, tests of statistical significance really 
are uiysticali the probability level is a magiccil indicatiai of whether results 
are hiportant — that is, have achieved the sacred .05 level of statistical 
significance. And, just as people are often raised to aoo^ religious 
affiliatiwi and practice, rather than eaicouraged to qiwsticsi xmderlying 
tenets, so do statistics courses and other graduate education experiences, as 
noted above, often foster the unthinking acceptance of statistical 
significance practices and over^interpretatiOTS rather than challenging their 
validity and useftilness. The acceptance of the ritual diverts attention frcro 

25 

ERIC 



ijiportant natters such as the role of replication in towled^ verification, 
the wpocting of effect sizes to iniicate the inagnitade of results, and the 
^iei^hlng of bQ»f its anJ costs in making judgments about the educational 
significance of results. 

Considerable social pressure also contrilwtes to the continued 
unthiridn? use of tests of statistical significance. I can attest, as I am 
sure many readers of this paper can, to the subtle as well as overt pressures 
that have led ms to have my own doctoral ca n d id a t e s report tests of 
statistical significance vdien they were basically meaningless additions to 
interpretatiOT (a practice I have abantoied) . Rejecting tte use of statistics 
is lltely to be particularly threatening— that is, it raises fears of 
rejection by, even humiliation at tl» hands of, acadeaadc peers— for those vAio 
do not- feel ocmf ortable with mathfimatics or with their philosctJhical 
understanding of the logic of statistics. 

Such influence is present in the broatksr research domain as well. Glass 
et al. (1981, pp. 197-199) discussed \Aiy inferential statistics are not 
apprxapriate for meta-analyses in which the reviewers are analyzing data f roa 
populations or near papulations of studies. They noted, hcwever, that TUkey 
had "chided them for not presCTting standard errors of the more inportant 
averages," despite their reasons for not doing so. He maintained that 
"regardless of such ccnplications, seme rudimentary inferential calculations 
would be informative and useful." After that revelation, Glass et al. 
presented a lei^thy discussicn of the application of inferential statistics in 
meta-analysis. 

Another part of the social context is professors *Ax> teach inferential 



-25- 

statistics ocurses and xesearcheis whose reputations have depentiled on 
statistically significant results. Both hatve psyciiological investinents, as 
well as personal stakes, in the continued legitimacy of tests of statistical 
significance. Carver (1978) was correct: The infl\ienoe of statistical 
significance testing will not be easily dlminishea because "too many have a 
vested interest in it" (p. 397) . 
I'Qradiqmtic Effects 

Brealdng out of ritual, then, is not oily a prcblem for those at the 
Icwer levels of statistical practice. As Ohcnpson (1987, 1989) noted, 
paradigms, includinj aoo^Jtod ways of thinking about research, are difficult 
to examine. Hisy ccnie to be taken for granted as natural thought and they 
carry normative inplications for what is appropriate thinking. F&radigms 
dcaninate the thinking of the high priests as well as their followers, in 
scienoe as well as in educaticral research, and are resistant to change in 
all. 

As Lightman and Gingerich (1992) noted, there is in scieix» a strong 
preference for "explanaticsTs that are mechanistic, logical, and calculable" 
(p. 694) —certainly an apt descripticn of tests of statistical significance. 
Li^tnan and Gingerich also su^ested that even in the face of anomaly, 
scientists— and one lai^t assume, educational researchers — ^are conservative, 
"reluctant to change their ei^lanatory frameworks" (p. 694) . Hie explanations 
for this inertia include social and cultural feu*ors— as noted above and as 
discussed by Barber (1961) , such as the force of shared beliefs and 
ocannitments, the influence of and desire for professional prestige, resistance 
to "noispecialists", and adherence to schools of thought— and psychological 
factors, such as being ocrofortable with the familiar and desiring to avoid the 

27 



-26- 



dlsoanfort of cognitive dissananoe (Festinger, 1957) . 

■n» resistanoe to rqr>^<<3ma<'^g change my be so stxong that facts are 
not even acknowledged as ananalous and thus a reason for change. Testts of 
statistical significance seen to fit in that category De^ite all of the 
critiques, the ur?aeasant facts about the limited applicability of tests of 
statistical significance seem not to be recognized as anomalcus with tl^ir 
use; thus, the need to upset established pscactices is avoided. 

As one exanple of a paradigmatic influaioe in educational researcii, 
•Uioopson (1987, p. 4) cited the unthirfdx^ use of analysis of variance, which 
leads individuals to thlrfc in terms of the statistical significance of 
differences between or amcswg means, instead of regression techniques that 
focus attentiOT on relaticaiships. Other exanples of \Aiat appear to be 
paradigmatic influence are not too difficult to find. I iioted abcwe the 
oontiiiued inappn^ariate applicaticsi of inferential statistics in meta-analysis 
when the reviewer has not sanpled from a population of studies, tut has done 
an exhaustive literature search. Sophisticated discussicai of the ocnpU^atiCTi 
of stancSard errors and establishing confidence intervals in meta-analysis 
(e.g. , Hedges & OUdn, 1985? Rosenthal, 1984; Wolf, 1986) and the use of those 
inferential statistics in reporting meta-analytic results (e.g. , Sdilaefli, 
Rest, & Ihoina, 1985) seen to me to be the result of a relentless paradigmatic 
influence (Shaver, 1991). 

Ptwer analysis again . Another axanple ocwes frcm pcwer analysis. Even 
those Who are vocal critics of tests of statistical significance seem unable 
to shake loose frm the nagging thought that tests of statistical significance 
mi^ have joxb meaniixr than they acoonl to thesn in their criticisms. S. A. 
Ochen and Hyman (1981) strongly attacked the use of statistical significance 



28 



-27- 

and advocated hypothesis testing 'tosmd en the prior specification of a 
critical effect sigg— *«the winiTOTin value . . . that the researdher has defined 
as educationally, or scioitif ically or practically significant" (p. 60) . Yet, 
they also suggested the oontinuatioi of . power analyses and the r^rting of 
prdaability levels. J. Oohen (1990) (to %Axin, of course, isich of the eaofihasis 
on pcMer ana''ysis can be attrifaixked, based cn his 1969 bodk) , laid cut a 
devastating criticism of statistical tests of the null hypothesis, but then 
went cn to advocate pc^^ analysis. His discussion illijstrates that cnoe caie 
has aoo^pted the lack of information provided by tests of statistical 
significanoe, pcMer analysis is a vacuous intellectual game. 

It is difficult to argue with J. Odbm*B (1990) ocxitentiOT that 
researchers should plan their researdi, or with his prc^xssal that a "tentative 
informed judgment" be made about the pcpulaticsi effect size imder 
investigation. What is qoestiraable is his reocnraencSatian that the planning 
include the risk in regard to a Type I error that the researcher is willing to 
aoo^ and the statistical pcwer that is desired, so that once these items 
have been ^jecified, "it is a sinple matter to determii» the sairple size you 
need". Moreover, if "the sairple size is beyond your resources, consider the 
possibility of reducing ycur power deamand or, perhe^js, the effect size, or 
even (heaven help us) increasing your alpha level" (p. 1310) . What Is the 
purpose of pcwer analysis and the arbitrary manipulatiai of criteria in order 
to help ensure that the researcher will cbt£iin a desired level of 
prcbabUity, when statistical significance has so little neaning?* 

S. A. Ochen and Hyman (1981) also reccnBoended the same sort of 

* 

* In that context, concern over \Aiether to state a directional or 
nondlrestiaal alternative l^pothesis (Pilleamer, 1991) seems to roe to be an 
equally esipty exercise. 



ERIC 



29 



littellectual game playiiich-that is, after •MsuaUy" setting a critical effect 
size, they then "set n which is often a feasibility decision, • . . then 
juggle aljoha and beta error risks to try to maxiinize pcMer as caose to 80% 
(beta error = .20%) setting alpha at .05, .06, .10, etc., d^jenJing ai %a»t 
the power tables tell us . . .» (p. 53) . Wxy "focus cn Critical ES ratter 
than on statistical signif icanoe without giving me ^ litt^ [tmderlining in 
the original]" (p. 64)? 

If effect sizes are isiaortant because statistical significance 
(prdoability) is not an adequate indicator of the magnitude of result (or vaxii 
of anything else) , why play the game of adjusting r esea r ch specifications so 
that a statistically significant result can be obtained if a prespecif led 
effect size is cbtalned? As Carver, 1978, pointed out: "Researchers should 
ignore statistical significanoe testing \*ien designing research; a stuc^ with 
results that cannot be meaningfully interpreted without looking at the 32 
values is a poorly designed study" (p. 394) . 

Another possible illustration of pa r a d igmatic influence was mentioned 
earlier: Itosenthal's (1991) reocinnendation that, along with effect sizes, 
statistical significanoe and ths statistical power of tests of statistical 
significance be r^xarted in replicaticais. That makes little sense, Uie 
question of interest is whether an effect size of a magnitude judged to be 
lnportant has been consistently obtained across replications of adequate 
f ictelity, not whether the result frcin a r^licaticHi was statistically 
significant or whether the design had adequate power for a result to be 
statistically significant. Carver (1978) was correct: "Replicated results 
autcnatically xoake statistical significanoe unneoessary" (p. 393) . 



3(t 



-29- 

Another e>q>lanaticn for the persistencae of statistical signif icanoe 
testing is the influence of journal editors. It is acBoonly claimed (see, 
e.g. , Kapfersmid, 1988; Neali^, 1991) that researchers tend nut 'co sutmit 
results that do not reach the .05 level of statistical significance often 
becaus e they deem suc4i findings to be uniinportanc, but also because they 
believe that journal reviewers and editors are lanliJcely to accept the results, 
even with acoonpanying effect sizes. It is also believed that journal editors 
have policies against publishing r^licaticn studies. 

Fublicatian is cxiicial to success in the cu^ademic warlr*. Researchers 
^aape their researdi, as well as the manuscripts reporting the research, 
according to accepted ways of thinking about analysis and interpretation and 
to fit their perceptions of what is pUhlishable. To hreak fron the mold might 
be cxwragecws but, at least for the untenured faculty meofcer with sane 
ccomitment to self-interest, foolish. As gateke^jers to the publishing realm, 
journal editors have tremendous power. As Carver (1978) and Kupfersmid 
(1988) have argued, a decline in statistical significance testing is not 
liJcely to occur until journal editors take a stand against it. 

What Should Editors Do? 

Editors are, of ocwrse, caught up in the tangled web of influences that 
have been sparsely d ls ais- ygd above. They are not immune to the allure and 
emotional power of ritual and th^ may themselves, be sanewhat in awe of the 
high priests, perhaps because editors, too, may not be ocmfoctable in their 
understanding of or capacity to diallenge tests of statistical significance, 
like the rest of us, editors have difficulty separating thesiselves fron their 
educational rearing and do not like to take public stances that might result 

31 



-sa- 
in oipcdbriiM from their academic colleagues. 

It my be unrealistic, as Carver (1978) recognized, to expect jcurxial 
editors to beoccie cru^iders for an agnostic, if not atheistic, ^jproach to 
tests of statistical significance. What editors can aoocuplish is certainly 
limited by the researdti ecology within v«>ich they work. Ohey should, hcwever, 
fcs familiar with the lon^tanding criticisms of statistical significance 
testing, such as cited at the beginning of this p^»r, and the inplicaticns 
for fdit^'-jaT policy and practice should be given serious cxnsideraticn. OJie 
result should be policy that does not rely on the assu mgd episteodogical 
validity of tests of statistical signif icaiKse and the off-shoot, pcwer 
analysis (as UianpsOT»s, 1987, proposed nodel does) . Following are guidelines 
for such a policy and for editorial practice. 
statistical Signif ic?m 

In general, authors should be encouraged, eveai required, to miniudze 
statistical significance in their analyses and interpretatics^. Reports of 
tests of statistical significance that are not based on randcmized sanples 
(rantJcmly selected anchor rantonly assigned) should not be publi^ied. For 
those studies in Which randcmization was an element in the design, authors 
should be restrained from interpretations that go beywid the legitimate 
cmdusion in regard to whether the result is a likely oocairrenoe assuming the 
null hypothesis to be tn». Ihe use of tests of statistical significance to 
reject the null hypothesis, to declare a particular result to be nondianoe 
under the null hypothesis, to indicate the probability that an alternative 
l^pothesis is true, as indicators of the probability that results are 
replicable, as measures of the magnitude of the result, or as indicaticHis of 
treatment effectiveness should not be accepted. 



ERIC 



32 



Even with randcnmess, authors should be required to state expectations 
for (hypotheses about) goantitative results as critical effect sizes (S. A. 
Odttsn & Hynan, 1981} . (Fxfect sizes other than the stancSaxdized mean 
difference are available [see, e.g., J. Oohen, ld88; Shaver, 1991].) Effect 
sizes should be required in reporting the findings fran quantitative research, 
along with descriptive statistics (and verbal descriptiOTs) that will help 
readers to interpret results and to evaluate the authors' Interpretatlois and 
oanclusiosis about the populations and settings to which the stud/ findings 
night be genercilized. In short, studies should be published without tests of 
statistical significance, but not without effect sizes. And, it should be 
made dear that, with effect sizes specified, power analysis is not relevant. 

Equally iuportant, authors ^ould be required to pixvide justification 
for their interpretations of the educatioial significance of effect sizes. 
There already is a tendency to use criteria, such as J. Oohen's (1988) 
standards for small, medium, and large effect sizes, as mindlessly as has been 
the practice with the .05 criterion in statistical significance testing. 
Ocmsideraticai of the value of the outccane — including the validity of 
assessment — and the human as well as financial costs in producing the outocro? 
shaild be mandatory. In addition, authors should be asked to oonftont the 
r^anoducibility of the result as an element of its educational signif icanoe 
(Shaver, 1991). 

^^^^^^ ^O^^ ^ 

Ura mention of reproducibility leads directly to r^lication, widely 
agreed vipcsi as a crucial element of scienoe but largely missing frcn reports 
of educational research (e.g. , Shaver & Norton, 1980a, b) . Sditors should not 



33 



only actively encourage the reporting of replications, Uit in many instances 
desnani xeplication before results can be published. In doing so, hcwever, 
th^ ^lauld not confUse replicaticn with reviews of research. Even if dene in 
a meta-analytic frameuGric, reviews are not ecjoival^ to replicaticns, despite 
inplicatiois to the ocsntrary by scne authors (e.g. , Bangert-Dccwns, 1986, p. 
398; Carlberg & Miller, 1984, pp. 9, 10; Fiske, 1983, p. 67; Hedges & Olkin, 
1985, p. 3; Hmber & Schmidt, 1990, p. 37; Jackson, 1980, p. 445; Boencw & 
Itosenthal, 1989, pp. 1280-1281; J. Cohen, 1990, p. 1311) . As I have noted 
elsewhere (SBhaver, 1991) , "the post hoc asse9sft>ly of studies that have few, if 
any, planned oomections and [that have] unknown differences among them is 
[not] an adequate substitute for purposely designed r^licaticats" (p. 91) . 
Rsviews of tiie literature have their place, but they do not have the 
e«planatory power of planned r^lications. 

A proposal for a shift in editorial policy to an eniiiasis an 
r^lications raises several issues. One is the matter of ^oe: It may take 
more journal space to publish study descriptions that are adequate to allow 
replication and to 'describe r^lications sufficiently to allow the evaluation 
of their fidelity. Moreover, with replicaticais given higher priority, fewer 
"original" studies could be published, reqixiring more critical judgments about 
study validity and isportanoe. In the long run, the outccine should be better 
studies, an increased nunber of replication studies, and greater knwledge 
productivity. However, careful attention to standards will be crucial, 
raising other issues to be addrpssed by journal editors. 

All studies are not worth replicating; r^licatioi will not make a 
trivial study worthwhile. And not all replications of potentially iaportant 
studies will be worth publication. Criteria nust be ^ecified by which to 



34 



-33- 

judge initial studies axe nontrivial, %ihen a direcst n^lication is 

adequately implicative, a sufficient nuito: of independent xeplications 
have been puhHsbed to establisli the reliability of a finding, and whether a 
systesnatic r^ication loakes a substantial oGntribution to generalizability 
(e.g. , Rosenthal, 1991) . Hupfersmid's (1988) prcposal that manuscripts first 
be su ti n i t ted to editors without the Results and Discussion secticns, so that 
the plblicaticai decision would be focused cn study justif icaticsi and design 
quality, is as relevant to r^lications as to initial studies. The logistics 
involved in handling aooeptanoes and later full sutndssions may, however, 
place unrealistic dpnwnds cn editors. 

Type of S^an^firarfTift 

An easily instituted editorial policy, prx^xased by Carver (1978, p. 
395) , is to insist that authors always use a modifier for the term 
"significant". Authors should not be allowed to make statenents siK:h as, "a 
confidenoe interval . . . tells you incidentally whether the effect is 
significant ..." (J. Cohen, 1990, p. 1310). As any reader of the literature 
is liJcely to have observed, it is more ocranon to use the term significant 
without a modifier than with me. 

Insisting vqptm a modifier would make it more evident to authors that 
they should ooatitenplate %*iether their results have educational or practical 
significance as well as statistical significance or, conversely, whett^ 
results lacJdng statistical significance mic^ not have educaticsial or 
practical significance. At the same time, readers would be alerted against 
overinterpreting a statistically significant result as necessarily 
educationally significant. Sensitivity to the isportant epistenological 
issues underlying the use of tests of statistical significance might also be 



35 



incraased* 

Another rather mcadest editorial action that might have injortant 
outoaass would be to delete misleading exarples txxm guidelines for authocB. 
Far exasple, as Bedker (1991) pointed out, the Publicatiogi mml s£ ^ 
^T^Hnan PBvcimlQaical Associatioi (1983) does not pKvide a good model of 
reporting, with horrendous exanples sixii as, »*tne first grade girls reported a 
significantly greater liking for school" (p. 81) and "the analysis of variance 
iixiicated a significant r ete n t icai interval effect" (p. 81) . Moreover, the 
tfenual discourages "the r^rting of negative results" (p. 19) , prcmoting the 
idea that only statistically significant results are iaportant. 

OonclusiCTi 

J. CChen (1990, p. 1311) suggested that changes in the way people 
conceptualize, conduct, and report researcii take time, as evidenced ty the 
over 40 years that elapsed before Gossett's t test made its way into 
statistics textbooks. However, criticisms of statistical significance testing 
have been around longer than 40 years and the practice seeans to ocxttinue 
unabated, despite a slight increase in the r^rting of effect sizes. It is 
probably too much to hope that 10 years hence the participants in an AERA 
^nposium on tests of statistical significance will be able to cite evidence 
that the criticisans are having a noticeable iapact on etSucational researchers* 
planning of analysis and r^jorting of quantitative results. 

•Hie reasons for the persistence of tests of statistical significance are 
ocirplex. The blame for the continuation of their use cannot be laid on 
journal editors alone. Nevertheless, given the role of editors in deteodning 
what gets published, they could be a powerful influence for more raticxial use 



38 



-35- 



of tests of statistical significant. BassiWe actions range tnm a rather 
modest Insistence on the use of a modifier with the term significance to the 
more f ar--reaciiing policy that r^rts of tests of statistical significance be 
restricted to those based on randanized samples. Researchers would be 
encouraged to analyze mora carefully the inpoctanoe of their results and to 
daacnstrate enpirically the reliability and generalizeibility of stud^ 



There are sufficient logical grounds, laid out in authoritative 
criticisms, on vfiiich to base <dianges in editorial policy to minimize, if not 
eliminate, the use of tests of statistical significance. No doubt action to 
limit the in^sprcpriate use of tests of statistical significance would be 



in the dcradnant educational research paradigm and the social and psychological 
pressures that enooure^ the oDntinuati<ai of the inferential ritual. Journal 
Fwblicatiai is, however, the most focused point in the process of research 
prxxSuction and r^rting, and editors have an historic opportunity to 
influence the course of educatiaial and psyciK>logical research and the 
production of research-based knowledge in a way not available to researchers, 
textbcK^ authors, car statistics professors. Modificaticais in piiblicaticn 
policies have great potential for effecting the changes in the use of 
statistical significance that have been called for over and over but largely 
ignored. The question is, who will lead the way? 



outccnes. 




for editors, given the strength of statistical significance testing 



37 



ERIC 



-36- 

Referenoes 

Balcan, D. (1967) . weXhod t Ttaward a isssstossfciax fif p?y^lpqi<^ 

investigation . San Ftancisoo: Jossey-Bass. 
Bangert-Drowns, R. L. (1986) . Review of develcpnents in meta-analytic 

methods. Pevcholoaical Mlsfeia, SI, 388-399. 
Barber, B. (1961) . Resistancse by scioitists to scientific discovery. 

Scienoe . 124, 596-602. 
Becker, G. (1991) . Alternative methods of reporting researdi results. 

liB^ic^ ^yj^Kjlsgisfe, 4S(6)f 654-655. 
Berk, R. A., & Brewer, M. (1978). Feet of clay in hctaiail boots: An 

assessment of statistical inference in applied researcii. gv^uatjqn 

S^^^d^mmX, If 190-214. 
Bracey, G. W. (1991) . Sense, non-sense, and statistics. Seita ^aPCT * 

21(4), 335. 

Brewer, J. K. (1985). Behavioral statistics textbodksj Source of myths and 

misconcejjtiofis? jgni}^\ of Educaticaial Statistics^ 10(3), 252-268. 
Canjijell, K. E., & JacksOT, T. T. (1979). Ihe role of and need for 

r^lication research in social psychology. |^l|,cations la Sssial 

Psvdholoav, 1(1), 3-14. 
Carlberg, C. G., & Miller, T. L. (1984). Xntroduction. 2lS OSlffiDal fif 

Special Education. 9-10. 
carver, R. P. (1978) . The case against statistical significance testing. 

j^ry?T^ ^aucational Review, 4a{3), 378-399. 
Cohen, J. (1969) . statistical ESSaSC ?ff>^Yg;is ^ ^ fe^vjffr^ SSiSS^- 

Hillsdale, NJ: Lawrence Erlbaum. 

3S 

o 

ERIC 



Ootisi, J. (1988). Statistical pcwer analysis for the b^iavloral scienoeg 

(2nd ed.)- Hillsdale, NT: laMienoe Erllsaum. 
C3Gtien, J. (1990) . Things I have learned (so far) . itoiBricati p^yij^Plgq^, 

4S(12), 1304-U12. 

OGhen, J., & Ochen, P. (1983). Applied mil tiple rBoressiCTi/oorrBlaUai 
^ m^y s^^ bdiavioral scieaxaes (2ni ed.) • Hillsdale, N7: lawrencae 

Erlbaum. 

Ocii^, S. A., & Hynian, J. S. (1981, April). Testing re search t^ypcatheses with 

prj,^iqq, m Ipsti^ s£ gt^^UsUgaJL sianificanoe in educaticm l n^g^dl. 

Vaper presented at the annuea aanferenoe of the American Bducational 

Research Association, Los Angeles, CA. 
Craribach, L. J. (1975) . Beytaid the two disciplines of scientific psychology. 

American Psvcholoqist . 2Q, 116-127. 
Dar, R. (1987) . Another lock at Meehl, lakatos, and the scientific practices 

of psychologists. American Bsvchologist . 4^(2), 145-151. 
Ferguson, G. A. , & Takane, Y. (1989) . Statistical analysis in psychology and 

education (6th ed.). New York: McGraw-Hill. 
Festinger, L. (1957) . A theory of cognitive dissonance . Evanstwi, XL: Row, 

Peterscai. 

Fiske, D. W. (1983) . Hie meta-analytic revolution in oatocnie research. 

journal gt <?onsqltjLnq Clinical Psvcholoav. ^, 65-70. 
Glass, G. V, (1976) . Primary, secondary, and meta-analysis of research. 

Educational Researcher . 1(10), 3-8. 
Glass, G. V, & Hqpkins, K. D. (1984). Statistical methods in educatic^ and 

psvcholocry (2nd ed.)> Englewood Cliffs, NJ: Prentice-Hall. 
Glass, G. V, McGaw, B., & Smith, M. L. (1981). Meta-analvsis in social 

research. Beverly Hills, CA; Sage, 



-38- 

Hazcum, E. R. (1989) . 1i» hi^y inaijptopriate calilsrations of statistical 

significance. American Psychologist. ^(6), p. 964. 
Hays, W. L. (1973). Statistics for the secisi ssl&os^ (2na ed.) . Nfew York: 

Holt, Rineliart and Winston. 
Hedges, L. V. , & Olkin, I. (1985) . Statistical BS^teSdS £SC T^-^mlYsj^- 

liesa York: Ac^Semic Press. 
Hunter, J. B. , & Schmidt, F. L. (1990) . Methods s£ mgi<;:^-ffla3-Y?ig; 

OorrBctincr error ssA feisa M research Uss^sm- Na&ury Park, CA: Sage. 
Jackscn, G. B. (1980) . Methods for Integrative reviews. Review st 

Bducatioial E^egs^, 50/ 438-460. 
Htpfersmid, J. (1988) . lijproving v«»t is published: A model in search of an 

editor. American l^^sydtoloaist . 43f8>. 635-642. 
Lightroan, A. , & Gingerich, O. (1992) . Wh^ do ancmalies begin? gglenP^r 

255 . 690-695. 

McHugh, R. B. (1964). Ifeed the randcndzed block design be n^licated? JSjg 

Journal of Experimental Education , 31(2), 169-174. 
Meehl, P. E. (1967) . iheory-testing in psychology and physics: A 

methodological paradox, aiilosodhv fif Sci«Toe . 3^, 103-115. 
Messick, S. (1989) . Meaning and values in test validation: The science and 

ethics of assessment. Educaticaial B^ssss^^, Ul(2) , 5-11. 
Morrison, D. E., & Henkel, R. L. (Eds.). (1970). Tps significance t^_ 

oontroversv; & reader . Chicago: Aldine. 
Neuliep, J. W. (Ed.) . (1991) . R^lication research in ^ sssisl ggjePPes. 

Iksiitujry Park, CA: Sage. 
Pilleroer, D. B. (1991) . One- versus two-tailed hypothesis tests in 

oontenporary educational reseazxli. Educational Researcher. 2^(9) , 13-17. 

40 



-39- 

Rosenthal, R. (1984) . Mata-analvtic proosdures social research . Beverly 
Hills, Cki Sage. 

Kosenthaa, R. (1991) . Replication in behavioral research. In J. W. N^i^ 
(Ed.) , Pf^toUoQ research in social scienoes (pp. 1-30) . Nswtxsry 
Park, CA: Sage. 

Rosncw, R. L., & Rosenthal, R. (1989). Statistical prooetlures and the 

justification of kncwledge in psychological sciencses. Arnearican Ps ychology . 
1276-1284. 

Salsfcurg, D. S. (1985) . Ihe religiai of statistics as practiced in nedical 

journals. 3SlS ftn>eri.<?an Statistician , aa(3) , 220-223. 
Schlaefli, A. , Rest, J. R. , & iharaa, S. J. (1985) . Does moral education 

inprove judgment? A meta-analysis of intervention studies lasing the 

Defining Issues Test. Review gg Educational Research . 5^, 319-352. 
Seeman, J. (1973) . On supervising student research. American Psychologist . 

2a, 900-905. 

Shaver, J. P. (1979) . The productivity of educaticxial research and the 

applied-basic research distinction. Bjucational Researcher . fi(l) , 3-9. 
Shaver, J. P. (1980, April) . Readdressing t^ig role fif statistical tests ef 

sjcBiificanog. Paper presented at the annual meeting of the American 

Educational Research Association, Boston, MA. 
Shaver, J. P. (1985a) . Chance and ncsTsense: A ocxwersation about 

interpreting tests of statistical significance. Part 1. gji Delta Kappan. 

67, 57-60. 

Shaver, J. P. (1985b) . Chance and nonsense: A ccsrversation about 

interpreting tests of statistical significance. Part 2. fisi Delta K&ppan. 
§3., 138-141. Erratum, 1986, SLf 624. 

1] 



-40- 

Shaver, J. P. (1987, Octcto:) . Mandates ^ <?V«iri<?\4WB SSA IPStyyrtim ^SEB 

educational ssA psychological i^figE^. Keynote address presented at the 

annual meeting of the Bducational Research Association, Park City, UT. 
Shaver, J. P. (1991). (Juantitative reviesdrg of research. In J. P. Qiaver 

(Bd.), Handbook gg researdi m social s^j^ t^a^im SJ^ l^aryOm (FP. 

83-95) . Mew Yoric: Macmillan. 
Shaver, J. P. , & Norttxi, R. S. (1980a) . Populaticans, sanples, randamess, 

and xeplicaticn in two social studies journals. 333SS££ SOA Ff^TTt) in 

Social Educatiai. &(2) , 1-20. 
Shaver, J. P. , & Nbrtcn, R. S. (1980b) . Randonness and r^licaticai in ten 

years of the American EducatlCTial pesemt^ JqtqTOl. ^apqational 

Signorelli, A. (1974) . Statistics: Tool or master of the psychologist? 

American Psychologist , 2S., 774-777. 
Skinner, B. F. (1956) . A case history in scientific method, ftnerjcan 

Psychologist . H, 221-223. 
Ihcrapson, B. (1987, April) . ijsg (and misuse) gg stat^jsti^ signj-fjcapge 

testing; Some reccamaendations for ixnoraved editorial peliSZ 30^ pr^gtioe. 

Fteiper presented at the annual meeting of the American Educational Research 

Association, Washingtcoi, DC. 
Thcnpscxi, B. (1989) . Uie place of qualitative research in conteatporary 

social science: Uie iicportanoe of post-^aradignatic tlKwght. Advances in 

Social Science Methodology : ^ Research im^, X, 1~42. 
TXikiey, J. W. (1969). Analyzing data: Sanctification or ctetective wrk? 

American Psychologist . 24(2) , 83-91. 



-41- 

Winch, R. F. , & csEDi{t)ell, D. T. (1969) . Proof: No. Evidence? Yes. The 
significance of tests of signiflcanoe. ftmBrican Sociolociist. A, 140- 
143. 

Winer, B. J., Brown, D. R., & Midiels, K. M. (1991). Statistical B^JIPlpJleg 
in experiraesntal design (3xd ed.) . New York: MoGraw-Hill. 

Wolf, F. N. (1986) . y^^-^TgOy^js; CMantitative methods £se, research 
synthesis . Beverly Hills, CAi Sage. 



43 



